Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PathLike wrapper/cache for ExternalStorage #186

Open
wants to merge 64 commits into
base: main
Choose a base branch
from

Conversation

dwhswenson
Copy link
Member

@dwhswenson dwhswenson commented Apr 26, 2023

This is a start to solving #120. We've identified the following goals:

  1. We want to be able to use an arbitrary external backend to shared (as well as permanent), which we already have via ExternalStorage, instead of it needing to be a directory on the same host.
  2. We want the user experience to mirror conveniences of pathlib.Path, which will (we hope) minimize frustration with the transition for users.

So far, this PR doesn't integrate that into any of the execution process. That can be either part of this PR or a future one.

The basic approach here involves 2 classes:

  • SharedRoot: represents the root directory for a given unit. By default, this will exist at $SCRATCH/.holding/$UNIT_LABEL. This is where most of the actual logic is. It contains references to all the paths that have been registered with it. On deletion (or on demand) this transfers all paths registered with it to the ExternalStorage.
  • SharedPath: represents any path within SharedRoot. Contains information on the actual file path on the system, as well as the label used as the key for the external key-value store. Creation is typically from root / "some_path" or some_path / "subpath". A SharedPath can be used like any other PathLike in open, and can be converted via pathlib.Path(shared_path) to get all the functionality of pathlib.

As I'm opening this, code is still in progress. But at least the tests should give a picture of usage; please take a look!


Update for some final to-do items.

  • Remove prefix; I believe it is no longer used
  • Reconsider when dag_label needs to be given
  • Switch delete_empty_dirs to keep_empty_dirs (consistent with other keep_*)
  • Check special cases where a directory gets registered as a StagedPath (due to __fspath__ getting called)
  • StagingDirectory => StagingRegistry
  • Re-read docs and comments for accuracy
  • Make StagingRegistry objects non-FileLike (potential footgun on os.path.join(context.shared, "filename.txt")).

@codecov
Copy link

codecov bot commented Apr 26, 2023

Codecov Report

Attention: 3 lines in your changes are missing coverage. Please review.

Comparison is base (a426dce) 99.21% compared to head (8489c24) 99.15%.

Files Patch % Lines
gufe/storage/stagingregistry.py 97.81% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #186      +/-   ##
==========================================
- Coverage   99.21%   99.15%   -0.06%     
==========================================
  Files          36       38       +2     
  Lines        1907     2128     +221     
==========================================
+ Hits         1892     2110     +218     
- Misses         15       18       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Comment on lines 26 to 30
read_only : bool
write to prevent NEW files from being written within this shared
directory. NOTE: This will not prevent overwrite of existing files,
in scratch space, but it will prevent changed files from uploading
to the external storage.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This read_only is mainly to prevent the following mistake:

shared = root.get_other_shared(other_unit_prefix)
with open(shared / "oldfile.txt") as f:
    ... # good so far!

with open(shared  / "newfile.txt", mode='wb') as f:
    ... # uh-oh -- now we're writing into the wrong unit's space!

However, it won't stop open(shared / "oldfile.txt", mode='wb'). You can still overwrite that data. If that data is in a separate permanent storage, it will be fine -- only overwrite locally. (If you set shared up such that the directory it uses is identical to the internal holding cache here, of course, it would overwrite that.)

Comment on lines 94 to 99
def __del__(self):
# take everything in self.shared_dir and write to it shared; keeping
# our prefix
self.transfer_holding_to_external()
if self.delete_holding:
shutil.rmtree(self.shared_dir)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iirc __del__ isn't called on del obj but instead when garbage collection happens. Maybe instead make transfer_holding_to_external always explicit and don't rely on garbage collection related things

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

works for me ... it's an executor-level thing anyway, and I don't mind asking the executor writer to think about how files get moved around (don't want to ask the protocol writer to think about that)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or maybe if you want this behaviour, use a context manager, so instead of __del__ use __exit__

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This started out as a context manager. I switched it away because it makes actual usage kind of weird -- the entirety of the execution block where this is used gets indented by one level, which I don't particularly like. It gets worse if you have one of these for shared and one for permanent -- there's a syntax for setting multiple context managers at once, but I've also always found that kind of weird.

But I can switch it back to a context manager if you want 🤷‍♂️

gufe/storage/pseudodirectory.py Outdated Show resolved Hide resolved
gufe/storage/pseudodirectory.py Outdated Show resolved Hide resolved
gufe/storage/pseudodirectory.py Outdated Show resolved Hide resolved
@dwhswenson
Copy link
Member Author

Still in progress, but here's a quick update to what is included in this PR now. Before I dive any further into writing unit tests, take a quick look, @richardjgowers (at least at the contents of this comment).

Main idea: We have "staging" directories that are local. From a protocol author's perspective, shared, permanent, and scratch are PathLikes. However, the paths pointed to by shared and permanent are in the local scratch space, and the files there are transferred to external storage as part of the executor.

I've expanded this PR to include some tools for handling the entirety of the scratch / shared / permanent storage lifecycle. There are two context managers: one for the unit level that provides the PathLikes that are used by the protocol author, and which handles the cleanup at the end of that level, and one for the dag level that provides the thing that creates the unit-level context manager, and which cleans up at the dag level.

Here's an example from a protocol author's viewpoint (kind of faking our real units, but this gives the rough picture):

class Unit1:
key = "unit1"
def run(self, scratch, shared, permanent):
(scratch / "foo.txt").touch()
with open(shared / "bar.txt", mode='w') as f:
f.write("bar was written")
with open(permanent / "baz.txt", mode='w') as f:
f.write("baz was written")
return "done 1"
class Unit2:
key = "unit2"
def run(self, scratch, shared, permanent):
(scratch / "foo2.txt").touch()
with shared.other_shared("unit1") as prev_shared:
with open(prev_shared / "bar.txt", mode='r') as f:
bar = f.read()
# note that you can open a file from permanent as if it was
# from shared -- everything in permanent is in shared
with open(prev_shared / "baz.txt", mode='r') as f:
baz = f.read()
return {"bar": bar, "baz": baz}

For the executor, the basic picture is something like this:

manager = StorageManager(scratch, shared_external, permanent_external)
with manager.running_dag(dag_label) as dag_ctx:
    for unit in ordered_dag_units:
        with dag_ctx.running_unit(unit.key) as (scratch, shared, permanent):
            unit_result = unit.run(scratch, shared, permanent)  # or however we run the unit
        # when this context ends, scratch will be deleted, shared external storage will be
        # updated, and shared holding will be deleted
# when this context ends, permanent external storage will be updated, and permanent holding 
# will be deleted

For much more detail, see this lifecycle test:

def test_lifecycle(request, manager, dag_units):
# heavy integration test to ensure that the whole process works
# this is the primary test of _DAGStorageManager
storage_manager = request.getfixturevalue(f"storage_manager_{manager}")
permanent_root = storage_manager.permanent_root
shared_root = storage_manager.shared_root
results = []
unit1_dir = Path(storage_manager.get_shared("dag_label", "unit1"))
scratch1 = Path(storage_manager.get_scratch("dag_label", "unit1"))
scratch2 = Path(storage_manager.get_scratch("dag_label", "unit2"))
barfile = unit1_dir / "bar.txt"
bazfile = unit1_dir / "baz.txt"
foofile = scratch1 / "foo.txt"
foo2file = scratch2 / "foo2.txt"
all_files = {barfile, bazfile, foofile, foo2file}
with storage_manager.running_dag("dag_label") as dag_ctx:
for unit in dag_units:
with dag_ctx.running_unit(unit.key) as (scratch, shared, permanent):
results.append(unit.run(scratch, shared, permanent))
# check that the expected files are found in staging
exists = {
"unit1": {barfile, bazfile, foofile},
"unit2": {foo2file, bazfile}
}[unit.key]
for file in exists:
assert file.exists()
for file in all_files - exists:
assert not file.exists()
# check that shared store is as expected
expected_in_shared = {
"unit1": set(),
"unit2": {"unit1/bar.txt", "unit1/baz.txt"}
}[unit.key]
assert set(shared_root.iter_contents()) == expected_in_shared
# check that permanent store is empty
assert list(permanent_root.iter_contents()) == []
# AFTER THE RUNNING_UNIT CONTEXT
# Same for both units because unit2 doesn't add anything to
# shared/permanent
# Files staged for shared should be transferred to shared and
# removed from the staging directories; files staged for
# permanent should remain
for_permanent = {bazfile}
for file in for_permanent:
assert file.exists()
for file in all_files - for_permanent:
assert not file.exists()
# check that we have things in shared
expected_in_shared = {"unit1/bar.txt", "unit1/baz.txt"}
assert set(shared_root.iter_contents()) == expected_in_shared
# ensure that we haven't written to permanent yet
assert list(permanent_root.iter_contents()) == []
# AFTER THE RUNNING_DAG CONTEXT
# all staged files should be deleted
for file in all_files:
assert not file.exists()
# shared still contains everything it had; but this isn't something we
# guarantee, so we don't actually test for it
# assert set(shared_root.iter_contents()) == {"unit1/bar.txt",
# "unit1/baz.txt"}
assert list(permanent_root.iter_contents()) == ["unit1/baz.txt"]
# check the results
assert results == [
"done 1",
{"bar": "bar was written", "baz": "baz was written"}
]

Next tests will be to check the special cases that shared or permanent represent the same directories as the staging directory, or each other.

@dotsdl
Copy link
Member

dotsdl commented Jun 2, 2023

From power hour:

         with shared.other_shared("unit1") as prev_shared: 
             with open(prev_shared / "bar.txt", mode='r') as f: 
                 bar = f.read() 

Instead of having shared.other_shared("unit1"), which would allow a e.g. unitX to access unitA's shared space even if unitA wasn't an explicit dependency of unitX.

Instead, making unitX access unitA's shared via its ProtocolDAGResult object passed into run would avoid this issue.

protocol authors. In detail, this provides protocol authors with
``PathLike`` objects for ``scratch``, ``shared``, and ``permanent``. All
three of these objects actually point to special subdirectories of the
scratch space for a specific unit, but are managed by context manangers at
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
scratch space for a specific unit, but are managed by context manangers at
local execution space for a specific unit, but are managed by context managers at

:class:`.DAGContextManager` to create a context to run a unit. That context
creates a :class:`.SharedStaging` and a :class:`.PermanentStaging`
associated with the specific unit. Those staging directories, with the
scratch directory, are provided to the :class:`.ProtocolDAGUnit`, so that
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
scratch directory, are provided to the :class:`.ProtocolDAGUnit`, so that
scratch directory, are provided to the :class:`.ProtocolUnit`, so that

@pep8speaks
Copy link

pep8speaks commented Sep 19, 2023

Hello @dwhswenson! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2024-01-22 17:35:13 UTC

Copy link
Contributor

@richardjgowers richardjgowers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is partial, I'm still going to go through the tests and try breaking this locally myself

Comment on lines +30 to +31
network-scale analysis. Anything stored here will be usable after the
calculation has completed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"after the calculation" could equally apply to a Unit. Maybe instead "Anything stored here will be retrievable after the Protocol estimation has completed"

Comment on lines +13 to +15
In an abstract sense, as used by protocol developers, these three levels
correspond to three lifetimes of data:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another last chance to rename things, what if we instead just named the storage lifetimes against the thing they're scoped to. So scratch -> unit_storage, shared -> dag_storage and permanent -> campaign_storage. The idea being it's easier to remember their scope.

the end of the :class:`.ProtocolDAG`, but is not guaranteed to exist after
the :class:`.ProtocolDAG` terminates.
* ``permanent``: This is the data that will be needed beyond the scope of a
single rough estimate of the calculation. This could include anything that
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure I like "single rough estimate" here, a single estimate might be perfectly fine. Maybe instead, "this is the data that will be available for post-simulation analysis beyond the scope of a single DAG"


The ``scratch`` area is always a local directory. However, ``shared`` and
``permanent`` can be external (remote) resources, using the
:class:`.ExternalResource` API.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe mention that whilst these might not be a local directory, they will still act like one?

As a practical matter, the GUFE storage system can be handled with a
:class:`.StorageManager`. This automates some aspects of the transfer
between stages of the GUFE storage system, and simplifies the API for
protocol authors. In detail, this provides protocol authors with
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use "developers" instead of "protocol authors"? I'm thinking developer is more common, someone might get confused thinking "protocol author" is the agent which is submitting protocols or something daft

Comment on lines +111 to +112
"""Transfer a given file from staging into external storage
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if I've followed this, it returns the held_file back if it was transferred, but otherwise None? Can store_path ever (knowingly) fail?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if I've followed this, it returns the held_file back if it was transferred, but otherwise None?

Yes. The real point here is that transfer_staging_to_external, which tries to transfer everything, returns the list of files that have been transferred. That was needed because it turned out that the successfully transferred files are needed by the StorageManager if keep_shared is False. I might be able to move that responsibility into a delete_transferred method on StagingRegistry, though.

Can store_path ever (knowingly) fail?

I'm not sure what you mean here by "knowingly." If store_path does not successfully store the path, it should raise an exception. The only way it can communicate failure is via exception, and if it fails silently, that's a bug.

"""
other = self._get_other_shared(prefix, delete_staging)
yield other
other.cleanup()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we instead put a __enter__/__exit__ etc on SharedStaging? This feels strange to have some of the class' intended behaviour defined in this method?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So other_shared here would just return be return SharedStaging(...)

Comment on lines 52 to 58
self.shared_staging = SharedStaging(
scratch=self.scratch_root,
external=self.shared_root,
staging=self.staging,
delete_empty_dirs=delete_empty_dirs,
prefix="" # TODO: remove prefix
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is shared scoped to the lifetime of the StorageManager? Could you not create this inside the scope of the running_dag method?

from .stagingregistry import StagingPath # typing


class StorageManager:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doccing this with some usage examples would be handy

dwhswenson and others added 8 commits December 7, 2023 17:07
This required a pretty significant rewrite of the code.
Mainly matters for the StorageManager
Make a function that returns a pathlib.Path be the main thing, instead
of the property that returns a string.

Almost all usage of `.fspath` was to then wrap it in a pathlib.Path.
This is more convenient for users.
This is to avoid likely footguns related to using os.path.join. Includes
test of error on os.path.join.
network, are not recommended for ``shared``.


Details: Manangement of the Storage Lifetime
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: management

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants